1 Database Schema

Using taxonomic data from NCBI taxonomy and KEGG’s functional ontology we have built a graph representation of both databases

MATCH
    (k:ko{ko:'ko:K11258'})
MATCH
    pathway=(taxon2:superkingdom)<-[:childof*]-(taxon:Taxon)--(contig:contigs)--(k)--(cpd:cpd)--(k)--(m:module)--()--(p:pathway) 
RETURN
    pathway
LIMIT
    1;
Purple for Taxonomic Ranks, Green for KOs, Blue for Compounds, yellow for KEGG Pathways and grey for KEGG Modules. Here the KO K11258, acetolactate synthase II small subunit, (in green, left) has a contig (purple) assigned to it via blastX using DIAMOND MEGAN. Below’s the cypher query to generate the above query.

Purple for Taxonomic Ranks, Green for KOs, Blue for Compounds, yellow for KEGG Pathways and grey for KEGG Modules. Here the KO K11258, acetolactate synthase II small subunit, (in green, left) has a contig (purple) assigned to it via blastX using DIAMOND MEGAN. Below’s the cypher query to generate the above query.

For more information about the fields within the nodes and edges/relationships please refer to the omics github repo

1.1 Combining metamapsDB with pAss pipeline

We can annotate which contigs are captured within pAss’s Max Diversity Region and apply the necessary processing to carry out diveristy analysis such as filtering for full spanning contigs with coverage above predefined values based on simulation analyses.

df = fread("/export2/home/uesu/simulation_fr_the_beginning/reAssemble/everybodyelse/out/mdrcontigs.txt") %>% setNames(c("contig", "mdr"))
write.csv(df, "/export2/home/uesu/simulation_fr_the_beginning/reAssemble/everybodyelse/out/mdrcontigs.csv", row.names=FALSE)
dbquery("
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///home/wesley/mdrcontigs.csv' AS MAPPING
MATCH (c:contigs{contig:MAPPING.contig})
SET c.mdr = 1
")

Example if we query ko:K03043, DNA-directed RNA polymerase subunit beta [EC:2.7.7.6].

Naively if we just search for the number of contigs/genes captured within the MDR of the this KO.

dbquery(
"MATCH
    (c:contigs{bin:'ko:K03043'}), (k:ko{ko:'ko:K03043'}) 
WHERE
    c.mdr = 1
RETURN
    k.ko as KO,
    k.simulated as simulated,
    count(c) as MDR,
    k.definition as definition"
)

After carrying out processing we can retrieve the surviving contigs

dbquery("
MATCH
    (k:ko {ko:{ko}})
WITH
    k.ko as ko,
    k.threshold as cutoff
MATCH
    (c:contigs {bin:ko})
WHERE
    c.mdr = 1 AND c.spanning =1 and toInt(c.readnum) > cutoff
RETURN
    ko,
    count(c) as remaining;
", list(ko='ko:K03043'))

2 Useful Functions

Before we begin, we’ll have to load the library and connect to the database

library(MetamapsDB)
connect("<yourNeo4jServer>")

With this we can carry out several simple operations.

For a given contig of interest, in this case a contig found in the K07219 bin, K07219:contig00050, we can

2.1 Query CPD/KO’s details

koname(koID)

koname('K18481')

cpdname(cpdID)

cpdname('C00001')

2.2 Query basic contig’s properties

contigInfo(ko, contig)

contigInfo(ko='K00343', contig='contig00001')
#contigInfo(ko='K18481', contig='contig00050')

And Find which taxa it has been assigned to based on diamond + MEGAN

contigInfo(ko, contig, withAnno=TRUE)

#koname('K18481')
#At the highest possible resolution, with a minimum score of 50
linked2contig = contigInfo(ko='K18481', contig='contig00005', withAnno=TRUE)
linked2contig

Show the taxonomic path path2kingdom(taxaID) all the way till Superkingdom rank.

path2kingdom(linked2contig$taxid) %>% make.data.frame

2.3 Add properties to the KO nodes

addKOProperty(ko, list(property=value))

Here we used the addKOProperty function which lets you insert properties on the KO node.

scgDF = MetamapsDB::scg %>% head(n=1) %>% lapply(function(koi) { addKOProperty(gsub("ko:", "", koi), list(scg=1))})
dbquery("
UNWIND
    {koname} as KOSS
MATCH
    (k:ko{ko:KOSS.ko})
RETURN
    k.ko,
    k.simulated,
    k.definition
", params = MetamapsDB::scg %>%
lapply(function(x){ list(ko=x) }) %>% list(koname = .)
) %>% make.data.frame

2.4 Find the number of MDRs

newblerDir= "/w/data/newbler/"
dynList = MetamapsDB::scg  %>% gsub("ko:", "", .) %>% mclapply(dynamicThreshold, root=newblerDir, mc.cores=20)
plotDF = dynPlots(dynList, F)
plotDF$df %>% select(ko:remaining.cutoff, -remaining) %>% unique %>%
apply(1, function(row){
    query = sprintf("match (k:ko{ko:'ko:%s'}) set k.threshold = %s return k.ko, k.threshold", row['ko'], as.integer(row['cutoff']))
    dbquery(query)
})
library(cowplot)
load("/w/out/plotDF.rda")
ggdraw() +
    draw_plot(plotDF$p, 0, 0.5, width=0.9, height=0.5)+
    draw_plot(plotDF$details[[1]]$plot + theme(legend.position="top") + ggtitle("") +  guides(fill= guide_legend(nrow =3)), 0, 0.06, width=0.5, height=0.4)+
    draw_plot(plotDF$violin, 0.5, 0.06, width=0.4, height=0.4)+
    draw_plot_label(c("A", "B", "C"), c(0, 0, 0.5), c(1, 0.4, 0.4), size=15)

2.5 Metabolic Analyses

You could do several interesting analyses with the MetamapsDB package

the pathways() function lists out the pathways in KEGG.

2.5.1 Example: Nitrogen Metabolism

  • path2ko(pathwayID) finds all KOs belonging to the same pathway

  • grepgraph(koID) and prettifyGraph constructs metabolic graph based on given KOs (vector) for plotting

  • grepgraph_cpd constructs metabolic graph based on the given CPDs (vector), similar to graphgraph but is meant for cpds

nitrogenMetabolism = pathways() %>%
    filter(grepl("Nitrogen", pathwayName )) %$%
    path2ko(pathwayID) %$%
    grepgraph(KO)

Enzyme subunits are given their own KO ids, so when you naivly plot the pathway you might get multiple nodes representing the same enzyme. With contractMetab you could simplify the graph.

nitrogenMetabolism.simplified = contractMetab(nitrogenMetab)

nitrogenMetabolism %>% prettifyGraph %>% plot()

nitrogenMetabolism.simplified %>% prettifyGraph %>% plot()

You could plot an interactive version of the metabolic network using

prettifyGraph coupled with ig2ggplot and plotly package’s ggplotly

p = nitrogenMetabolism.simplified %>% prettifyGraph %>% ig2ggplot(., dfOnly=FALSE)
ggplotly(p)

2.6 Trio analyses

findTrio trio.local

Combs the metabolic pathway graph for Trios where the middle KO is dominated by a single Genera/OTU`

3 MISC

3.1 Writing your own custome queries

dbquery(cypherQueryStr, parameters)

cypherQuery must return a table structure not nodes.

3.2 Aggregated cypher queries

When you run aggregated cypher queries you often run into the problem of a column made of lists if you just naively ran the dbquery function you can pipe the output into the make.data.frame function make.data.frame

3.3 Export graph to gexf format

igraph2gexf

3.4 Warming up

For more details see Official tutorial for warming up

When we first generate a omics database using the omics Dockerimage, the nodes are not indexed, which makes searching slow.

See R code below for building index of commmonly searched IDs eg. ko, contig, taxid, kind of like primary Keys in a SQL database.

dbquery(query="CREATE INDEX ON :ko(ko)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :contigs(contig)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :contigs(bin)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :Taxon(taxid)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :cpd(cpd)", justPost = TRUE)

You can check the status by running :schema in the console